World space is a very useful intermediary between camera space and model space. It makes it easy to position cameras and so forth. But there is a lingering issue when dealing with world space directly. Namely, the problem of large worlds and numerical precision.
Let us say that you're trying to model a very large area down to fairly small accuracy. Your units are inches, and you want precision to within 0.25 inches. You want to cover an area with a radius of 1,000 miles, or 63,360,000 inches.
Let us also say that the various pieces of this world all have their own model spaces and are transformed into their appropriate positions via a model-to-world transformation matrix. So the world is assembled out of various parts. This is almost always true to some degree.
Let us also say that, while you do have a large world, you are not concerned about rendering all of it at any one time. The part of the world you're interested in is the part within view from the camera. And you're not interested in viewing incredibly distant objects; the far depth plane is going to cull out the world beyond a certain point from the camera.
The problem is that a 32-bit floating-point number can only hold about 7 digits of precision. So towards the edges of the world, at around 63,000,000 inches, you only have a precision out to about ±10 inches at best. This means that vertex positions closer than this will not be distinct from one another. Since your world is modeled down to 0.25 inches of precision, this is a substantial problem. Indeed, even if you go out to 6,000,000 inches, ten-times closer to the middle, you still have only ±1 inch, which is greater than the tolerance you need.
One solution, if you have access to powerful hardware capable of OpenGL 4.0 or better, is to use double-precision floating point values for your matrices and shader values. Double-precision floats, 64-bits in size, give you about 16 digits of precision, which is enough to measure the size of atoms in inches at more than 60 miles away from the origin.
However, you would be sacrificing a lot of performance to do this. Even though the hardware can do double-precision math, it loses quite a bit of performance in doing so (anywhere between 25% and 75% or more, depending on the GPU). And why bother, when the real solution is much easier?
Let's look at our shader again.
#version 330 layout(location = 0) in vec4 position; uniform mat4 cameraToClipMatrix; uniform mat4 worldToCameraMatrix; uniform mat4 modelToWorldMatrix; void main() { vec4 worldPos = modelToWorldMatrix * position; vec4 cameraPos = worldToCameraMatrix * worldPos; gl_Position = cameraToClipMatrix * cameraPos; }
The position is relatively close to the origin, since model
            coordinates tend to be close to the model space origin. So you have plenty of
            floating-point precision there. The cameraPos value is also close to
            the origin. Remember, the camera in camera space is at the origin.
            The world-to-camera matrix simply transforms the world to the camera's position. And as
            stated before, the only parts of the world that we are interested in seeing are the
            parts close to the camera. So there's quite a bit of precision available in
                cameraPos.
And in gl_Position, everything is in clip-space, which is again
            relative to the camera. While you can have depth buffer precision problems, that only
            happens at far distances from the near plane. Again, since everything is relative to the
            camera, there is no precision problem.
The only precision problem is with worldPos. Or rather, in the
                modelToWorldMatrix.
Think about what modelToWorldMatrix and
                worldToCameraMatrix must look like regardless of the precision of
            the values. The model to world transform would have a massive translational component.
            We're moving from model space, which is close to the origin, to world-space which is far
            away. However, almost all of that will be immediately negated,
            because everything you're drawing is close to the camera. The camera matrix will have
            another massive translational component, since the camera is also far from the
            origin.
This means that, if you combined the two matrices into one, you would have one matrix with a relatively small translation component. Therefore, you would not have a precision problem.
Now, 32-bit floats on the CPU are no more precise than on the GPU. However, on the CPU you are guaranteed to be able to do double-precision math. And while it is slower than single-precision math, the CPU is not doing as many computations. You are not doing vector/matrix multiplies per vertex; you're doing them per object. And since the final result would actually fit within 32-bit precision limitations, the solution is obvious.
The take-home point is this: avoid presenting OpenGL with an explicit model-to-world matrix. Instead, generate a matrix that goes straight from model space to camera space. You can use double-precision computations to do this if you need to; simply transform them back to single-precision when uploading the matrices to OpenGL.